Introduction - Predicting Train Delays

NJ Transit is considered to have very complex dynamics affecting tens of thousands of people everyday as per a blog post. And it is the case with numerous other transit systems too. Many people rely on these transportation systems for their daily commute and delays may leave them irritated and dissatisfied with the system. Train delays leading to missed connecting trains/flights, miss opportunities, etc. may drive away customers, who are a key source of income here. One way this can be dealt with is by making sure that all trains are on time. But, in this non-ideal world, it is really difficult to maintain such perfection due to various unforeseen circumstances. Thus, train delay predictions come into the picture. We do have many apps that track the status of the trains live, but such applications will not help users plan their trips early on.

As already mentioned, trains delays are a cause of concern for many commuters as it is pretty random, or at least that is what it seems like to some people. Train delays can definitely be predicted to some extent provided there is up-to-date accurate data to learn the patterns from. Most times, train delays are due to random problems such as fallen trees blocking the track or technical issues with the train which can be approximated with the utilization of certain variables, which we have capitalized on in this project. These predictions can be used either by the commuters or by the authorities. The application will help the users commute reliably and more effectively, allowing them to plan hours or days prior to their journey, allowing them to make informed decisions. They can schedule their trips efficiently and choose days with low probability of delay if they have a flexible appointments or choose earlier trains to make it to their destination on time. This application can also be used by the authorities to counter expected train delays. But we will be focusing in designing an application to help the commuters.

The link to our youtube video : here

Setup

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(caret)
library(osmdata)
library(gifski)
library(gganimate)
library(mapview)
library(dplyr)
library(stargazer)

set.seed(42)

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = element_text(margin = ggplot2::margin(1, 1, 1, 1, 'cm')),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette3 <- c("#D2FBD4","#92BCAB","#527D82")
palette2 <- c("#6baed6","#08519c")

palette55 <- c("#25CB10", "#5AB60C", "#8FA108",   "#C48C04", "#FA7800")
qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}

Import Data

We are using multiple datasets to account for as many factors as possible. The Kaggle Train Delay dataset contains Amtrak and NJ Transit delay data. We are also using the New Jersey Stations dataset for station coordinates. We loaded the census data to account for the socio-economic factors, if any, that affect the train timings. Weather data is used to check the influence of rain/snow, wind and temperature on train delays. More information about the datasets used are included in the following sections.

Kaggle Train Delay dataset

The Kaggle Train Delay dataset contains data for Amtrak and NJ Transit trains. Amtrak doesn’t have delay information. Thus, our analysis will only use NJ Transit data. We have details of train lines, stations that are covered by trains in the order of their stops for a particular line. Train IDs are also included and we can observe that most trains are daily, weekly or bi-weekly. We are using the data from April 2018 (weeks 13 to 18 of the year).

# dat_2018_03 = read.csv('archive/2018_03.csv')
# dat_2018_04 = read.csv('archive/2018_04.csv')
# dat <- rbind(dat_2018_03, dat_2018_04)

dat <- read.csv('archive/2018_04.csv')
dat <- dat %>%
  mutate(line = tolower(line),
         scheduled_time =  as_datetime(scheduled_time),
         actual_time = as_datetime(actual_time),
         date = as_date(scheduled_time),
         interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
         hour = hour(scheduled_time),
         dotw =  wday(scheduled_time, label=TRUE),
         week = week(scheduled_time)) %>% distinct()

#unique(dat %>% select(week, date, dotw))

New Jersey Rail Stations dataset

The kaggle train delay dataset does not contain coordinates information, NJ stations dataset is used to get the coordinates for these stations. We are not considering Amtrak stations as the delay data is not available for them in the kaggle dataset.

#jlines <- st_read('Passenger_Railroad_Lines_in_NJ-shp/Passenger_Railroad_Lines_in_NJ.shp') %>%
  #st_transform('EPSG:2236')

njstations <- st_read('Railroad_Stations_in_NJ-shp/Railroad_Stations_in_NJ.shp') %>%
  st_transform('EPSG:2236') %>%
  filter(AMTRAK != "Y") %>%
  select(STATION, COUNTY, geometry, RAIL_LINE, LATITUDE, LONGITUDE) 

njstations.first <- njstations[match(unique(njstations$STATION), njstations$STATION),]

amtrak <- st_read('Amtrak_Stations-shp/Amtrak_Stations.shp') %>%
  st_transform('EPSG:2236')

Standardising Rail Line Names

Some of the stations names are slightly different, though they point to the same station. Here, we are standardizing those.

dat <- dat %>%
  mutate(new_line_name_geo = case_when(
                  line == 'atl. city line' ~ 'ATLANTIC CITY RAIL LINE',
                  line == 'bergen co. line' ~ 'BERGEN COUNTY LINE',
                  line == 'main line' ~ 'MAIN LINE',
                  line == 'montclair-boonton' ~ 'MONTCLAIR BOONTON LINE',
                  line == 'morristown line' ~ 'MORRIS & ESSEX',
                  line == 'no jersey coast' ~ 'NORTH JERSEY COAST LINE',
                  line == 'northeast corrdr' ~ 'NORTHEAST CORRIDOR LINE',
                  line == 'pascack valley' ~ 'PASCACK VALLEY LINE',
                  line == 'raritan valley' ~ 'RARITAN VALLEY LINE',
                  ))

dat_nj <- subset(dat, type != 'Amtrak') %>% na.omit() 

dat_njstation <- dat_nj %>%
  left_join(njstations.first, by = c('to' = 'STATION'))

nas <- dat_njstation %>%
  filter (is.na(COUNTY))

Getting the Coordinates for Stations

There are around 26 stations with missing Latitude, Longitude and geometry, even after the join. This is due to unmatching Rail line names or station names. This section uses NJ Stations data’s coordinates to fill up the missing values in the main dataset. We retain the original version of the coordinates too, as we will be using that to compute the distance between the train stations, which is one of our features for the prediction model.

#miss <- dat_njstation[(is.na(dat_njstation$RAIL_LINE)), ] %>% select(to) 
dat_njstation <- dat_njstation %>%
  mutate(LATITUDE_map = case_when(
                  to == "Middletown NY" ~ 41.45734,
                  to == "Ramsey Route 17" ~ 41.07505,
                  to == "Ramsey Main St" ~ 41.05691,
                  to == "Glen Rock Main Line" ~ 40.96225,
                  to == "Secaucus Lower Lvl" ~ 40.76119,
                  to == "Bay Street" ~ 40.80818,
                  to == "Watsessing Avenue" ~ 40.78271,
                  to == "Anderson Street" ~ 40.89446,
                  to == "Essex Street" ~ 40.71284,
                  to == "Wood Ridge" ~ 40.98067,
                  to == "Princeton Junction" ~ 40.31632,
                  to == "New Brunswick" ~ 40.49732,
                  to == "Metropark" ~ 40.56858,
                  to == "Newark Airport" ~ 40.70442,
                  to == "Secaucus Upper Lvl" ~ 40.76119,
                  to == "New York Penn Station" ~ 40.75005,
                  to == "Philadelphia" ~ 39.95655,
                  to == "Pennsauken" ~ 39.9783,
                  to == "Atlantic City Rail Terminal" ~ 39.36330,
                  to == "Point Pleasant Beach" ~ 40.09272,
                  to == "Middletown NJ" ~ 40.38978,
                  to == "Montclair State U" ~ 40.86978,
                  to == "Mount Arlington" ~ 40.89659,
                  to == "Mountain View" ~ 40.91398,
                  to == "Wayne-Route 23" ~ 40.90026,
                  to == "Teterboro" ~ 40.86489),
         LONGITUDE_map = case_when(
                  to == "Middletown NY" ~ -74.37036,
                  to == "Ramsey Route 17" ~ -74.14549,
                  to == "Ramsey Main St" ~ -74.14205,
                  to == "Glen Rock Main Line" ~ -74.13345,
                  to == "Secaucus Lower Lvl" ~ -74.07582,
                  to == "Bay Street" ~ -74.20868,
                  to == "Watsessing Avenue" ~ -74.19844,
                  to == "Anderson Street" ~ -74.04378,
                  to == "Essex Street" ~ -74.03608,
                  to == "Wood Ridge" ~ -74.12057,
                  to == "Princeton Junction" ~ -74.62375,
                  to == "New Brunswick" ~ -74.44565,
                  to == "Metropark" ~ -74.32933,
                  to == "Newark Airport" ~ -74.19071,
                  to == "Secaucus Upper Lvl" ~ -74.07582,
                  to == "New York Penn Station" ~ -73.99236,
                  to == "Philadelphia" ~ -75.18142,
                  to == "Pennsauken" ~ -75.06197,
                  to == "Atlantic City Rail Terminal" ~ -74.44148,
                  to == "Point Pleasant Beach" ~ -74.04819,
                  to == "Middletown NJ" ~ -74.11613,
                  to == "Montclair State U" ~ -74.19743,
                  to == "Mount Arlington" ~ -74.63273,
                  to == "Mountain View" ~ -74.26781,
                  to == "Wayne-Route 23" ~ -74.25697,
                  to == "Teterboro" ~ -74.06265)) 



dat_njstation <- dat_njstation %>%
  mutate(LATITUDE_geo = case_when(
                  to == "Middletown NY" ~ 2473857.10443312,
                  to == "Ramsey Route 17" ~ 2546600.43904687,
                  to == "Ramsey Main St" ~ 2548071.213794,
                  to == "Glen Rock Main Line" ~ 2553173.27811609,
                  to == "Secaucus Lower Lvl" ~ 2574931.67533377,
                  to == "Bay Street" ~ 2536756.2007145,
                  to == "Watsessing Avenue" ~ 2540316.10362768,
                  to == "Anderson Street" ~ 2579939.88324208,
                  to == "Essex Street" ~ 2587358.14925183,
                  to == "Wood Ridge" ~ 2571714.38784559,
                  to == "Princeton Junction" ~ 2424325.92246776,
                  to == "New Brunswick" ~ 2479578.73529207,
                  to == "Metropark" ~ 2509985.86462867,
                  to == "Newark Airport" ~ 2544684.21753966,
                  to == "Secaucus Upper Lvl" ~ 2553173.27811609,
                  to == "New York Penn Station" ~ 2598402.32682548,
                  to == "Philadelphia" ~ 2287748.12653078,
                  to == "Pennsauken" ~ 2320796.78031357,
                  to == "Atlantic City Rail Terminal" ~ 2511259.84277974,
                  to == "Point Pleasant Beach" ~ 2601870.52191446,
                  to == "Middletown NJ" ~ 2574389.58056374,
                  to == "Montclair State U" ~ 2538119.2934849,
                  to == "Mount Arlington" ~ 2416905.89259235,
                  to == "Mountain View" ~ 2517392.72564527,
                  to == "Wayne-Route 23" ~ 2520776.28615481, 
                  to == "Teterboro" ~ 2575574.40229659), 
          LONGITUDE_geo = case_when(
                  to == "Middletown NY" ~ 6300237.08919109,
                  to == "Ramsey Route 17" ~ 6165652.3195194,
                  to == "Ramsey Main St" ~ 6159109.81628919,
                  to == "Glen Rock Main Line" ~ 6124775.82298926,
                  to == "Secaucus Lower Lvl" ~ 6052714.51453407,
                  to == "Bay Street" ~ 6066957.35645596,
                  to == "Watsessing Avenue" ~ 6057891.02845568,
                  to == "Anderson Street" ~ 6102023.51420209,
                  to == "Essex Street" ~ 6035957.42686259,
                  to == "Wood Ridge" ~ 6082801.75680131,
                  to == "Princeton Junction" ~ 5887634.71567872,
                  to == "New Brunswick" ~ 5948577.7928019,
                  to == "Metropark" ~ 5977000.95892906,
                  to == "Newark Airport" ~6029503.90081312,
                  to == "Secaucus Upper Lvl" ~6052714.51453407,
                  to == "New York Penn Station" ~ 6050498.84576351,
                  to == "Philadelphia" ~ 5737001.50509207,
                  to == "Pennsauken" ~ 5746956.61254804,
                  to == "Atlantic City Rail Terminal" ~ 5535112.08561287,
                  to == "Point Pleasant Beach" ~ 5809520.88307356,
                  to == "Middletown NJ" ~ 5916376.13106645,
                  to == "Montclair State U" ~ 6089666.71633157,
                  to == "Mount Arlington" ~ 6090338.96278667,
                  to == "Mountain View" ~ 6104270.70311063,
                  to == "Wayne-Route 23" ~ 6099501.47722319,
                  to == "Teterboro" ~ 6090822.97149897))
coords <- as.data.frame(st_coordinates(st_as_sf(dat_njstation)))
dat_njstation$LATITUDE_exist = coords$X
dat_njstation$LONGITUDE_exist = coords$Y

dat_njstation <- dat_njstation %>%
  mutate(LONGITUDE_geo = ifelse(is.na(LONGITUDE_geo),LONGITUDE_exist,LONGITUDE_geo),
         LATITUDE_geo = ifelse(is.na(LATITUDE_geo),LATITUDE_exist,LATITUDE_geo),
         LONGITUDE_map = ifelse(is.na(LONGITUDE_map),LONGITUDE,LONGITUDE_map),
         LATITUDE_map = ifelse(is.na(LATITUDE_map),LATITUDE,LATITUDE_map)) %>%
  select(-LATITUDE_exist,-LONGITUDE_exist,-LATITUDE,-LONGITUDE)

dat_njstation <- dat_njstation %>%
st_as_sf(., coords = c("LONGITUDE_geo", "LATITUDE_geo"), crs = 3701)

Census Data

Census data for States - New Jersey, New York and Pennsylvania have been imported. Factors which we think may have some influence directly or indirectly on the trains and their running times have been chosen. We have taken total population, median income, median age, population of white people, travel time, number of commuters, means of transport available and total public transportation for the regions that NJ transit covers. The plot below shows the total area that is covered by the census tracts and that the region covered by the NJ Transit stations is a small subset.

census <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2018, 
          state = c("NJ",'NY','PA'),
          geometry = TRUE, 
          output = "wide")

census <- census %>%
  dplyr::rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport) %>% 
  st_transform('EPSG:2236')

subset_census <- census[st_crop(census, st_bbox(njstations)),]

# mapview(st_bbox(census)) +
#   mapview(njstations, cex = 2) +
#   mapview(st_bbox(subset_census), col.regions = "orange")

Here. we have plotted the subset, only those areas that are of out interest due to their proximity to the NJ Transit stations.

ggplot() +
  geom_sf(data = subset_census %>% st_transform('EPSG:2236')) +
  geom_sf(data = njstations %>% st_transform('EPSG:2236'), color = "blue") 

Weather Data

We are collecting weather data for Trenton for the same duration as our original data set (April 2018). We make an assumption that overall weather patterns will remain similar throughout the states. For a more granular study, it would make sense to include other location’s weather data as well. Weather helps to accounts for certain situations that may be a potential cause for delay.

#Every hour
weather.Panel_hourly <- 
  riem_measures(station = "TTN", date_start = "2018-04-01", date_end = "2018-05-02") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
  
    group_by(interval60) %>%
    dplyr::summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
glimpse(weather.Panel_hourly)
## Rows: 743
## Columns: 4
## $ interval60    <dttm> 2018-04-01 00:00:00, 2018-04-01 01:00:00, 2018-04-01 0…
## $ Temperature   <dbl> 48.0, 46.0, 44.1, 42.1, 42.1, 43.0, 43.0, 45.0, 46.9, 4…
## $ Precipitation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Wind_Speed    <dbl> 13, 13, 11, 10, 9, 10, 10, 11, 10, 11, 8, 6, 8, 9, 9, 1…
#Every day
weather.Panel_daily <- 
  riem_measures(station = "TTN", date_start = "2018-04-01", date_end = "2018-05-02") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(date(interval60)) %>%
    dplyr::summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
glimpse(weather.Panel_daily)
## Rows: 31
## Columns: 4
## $ `date(interval60)` <date> 2018-04-01, 2018-04-02, 2018-04-03, 2018-04-04, 2…
## $ Temperature        <dbl> 61.0, 53.1, 43.0, 64.9, 45.0, 61.0, 55.0, 44.1, 46…
## $ Precipitation      <dbl> 0.0000, 1.4107, 0.5507, 1.0803, 0.0000, 0.3302, 0.…
## $ Wind_Speed         <dbl> 16, 13, 8, 25, 20, 21, 16, 16, 11, 13, 12, 20, 17,…

Time-space Panel + Weather

Now that the dataset has been merged with the daily weather data, we can move forward to make useful conclusions with the data that we currently have.

dat_njstation$date <- as.Date(dat_njstation$date)
dat_njstation <- 
  dat_njstation %>%
  left_join(weather.Panel_daily, by = c('date' = 'date(interval60)'))

Exploratory Data Analysis

We perform exploratory data analysis to understand how the independent variables are related to the dependent variable (delay). The histogram below shows the frequency distribution of delay (in minutes) for the month of April 2018. We can observe that most trains are in the lower delay range. Few trains have higher delay times.

hist(dat$delay_minutes)

Descriptive Statistics

The table below shows the descriptive statistics. such as mean, standard deviation, minimum and maximum, for some of the independent variables in our dataset.

numericVars <- 
  select_if(dat_njstation %>% 
            select(date, train_id, delay_minutes, Temperature, Precipitation,
                   Wind_Speed, stop_sequence), is.numeric) %>% na.omit()
stargazer(numericVars, type = "html", title="Descriptive statistics for internal features", digits=1)
Descriptive statistics for internal features
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
quantile(dat_njstation$delay_minutes, c(.32, .57, .98))
##       32%       57%       98% 
##  1.150000  3.016667 19.380667

Weather Panel (Hourly)

The below shows the variation of weather (Precipitation, Wind Speed and Temperature) on an hourly basis.

grid.arrange(
  ggplot(weather.Panel_hourly, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme,
  ggplot(weather.Panel_hourly, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel_hourly, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - Trenton TTN - April, 2018")

Delay vs Precipitation (Daily)

The bar plot below shows the relation between train delay and precipitation (Rain/Snow). We observe that mean delay is lower on a rainy/snowy day. This may be due to lesser number of passengers owing to the weather conditions (only those that need to travel are out and the rest mostly stay put). But the mean difference isn’t a very significant number. There is an average difference of less than 1 minute.

dat_njstation %>%
  group_by(date) %>% 
  dplyr::summarize(delay_minutes = mean(delay_minutes, na.rm=T),
            Precipitation = first(Precipitation)) %>%
  mutate(isPrecip = ifelse(Precipitation > 0,"Rain/Snow", "None")) %>%
  group_by(isPrecip) %>%
  dplyr::summarize(Mean_delay = mean(delay_minutes, na.rm=T)) %>%
    ggplot(aes(isPrecip, Mean_delay)) + geom_bar(stat = "identity") +
      labs(title="Does precipitation influence delay?",
           x="Precipitation", y="Mean Delay (Minutes)") +
      plotTheme

Delay vs Wind speed (Daily)

The bar plot below shows the relation between train delay and wind speed. Looks like the average train delay is higher on a windy day. Strong winds are known to be a safety threat to trains due to their devastating effects which can range from uprooted trees/branches on the railway tracks to toppling of the trains. Especially the trains running above the ground reduce speeds when its too windy. Thus, wind is a reasonably good factor to account for certain train delays caused by blocked tracks, lower train speed etc. But the mean difference in the delays is less than a minute for the month of April 2018.

dat_njstation %>%
  group_by(date) %>% 
  summarize(delay_minutes = mean(delay_minutes, na.rm=T),
            Wind_Speed = first(Wind_Speed)) %>%
  mutate(isWind = ifelse(Wind_Speed > 20,"Windy", "None")) %>%
  group_by(isWind) %>%
  summarize(Mean_delay = mean(delay_minutes, na.rm=T)) %>%
    ggplot(aes(isWind, Mean_delay)) + geom_bar(stat = "identity") +
      labs(title="Does windspeed influence delay?",
           x="Windspeed", y="Mean Delay (Minutes)") +
      plotTheme

Delay vs Temperature (Daily)

The bar plot below shows the relation between train delay and temperature. We see that hot temperature causes higher delay. It is common for the rail operators to slow the trains down when the temperature is high as the tracks expand/distort. This is done to ensure safety. The difference in mean delays is again less than a minute for the month of April.

dat_njstation %>%
  group_by(date) %>% 
  summarize(delay_minutes = mean(delay_minutes, na.rm=T),
            Temperature = first(Temperature)) %>%
  mutate(isHot = ifelse(Temperature > 60,"Hot", "None")) %>%
  group_by(isHot) %>%
  summarize(Mean_delay = mean(delay_minutes, na.rm=T)) %>%
    ggplot(aes(isHot, Mean_delay)) + geom_bar(stat = "identity") +
      labs(title="Does temperature influence delay?",
           x="Temperature", y="Mean Delay (Minutes)") +
      plotTheme

Distribution of Total delay for different trains

The bar graph shows the distribution of total delays for different trains, and we can clearly observe that most trains have small value of delay (in minutes).

delay_train <- dat_njstation %>%
  select(train_id, delay_minutes, date) %>%
  group_by(train_id, date) %>%
  dplyr::summarize(total_delay = mean(delay_minutes))

ggplot(delay_train, aes(total_delay)) +
  geom_histogram()

Distribution of Delay by Time of the Day

We can observe that most delays lie withing the 15 minutes range except for overnight trains for which the distribution of delays is slightly different (with most trains having zero delay).

dat_njstation %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, to, time_of_day) %>%
         dplyr::summarize(avg_delay = mean(delay_minutes)) %>%
  ggplot()+
  geom_histogram(aes(avg_delay), binwidth = 5)+
  labs(title="Average delay across stations for different times of the day, 2018",
       x="Delay (minutes)", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

Delay by the Day of the Week

The delays seem to be highest on Mondays. The weekends seem to have lesser delay, maybe due to less commuters on those days. The other days of the week have almost the same distribution. Late morning to late noon, the delay seems to be comparatively lower on weekdays, maybe due to reduced footfall as the working class and professionals would have already reached their destinations by then. Thus, its a good time to travel if the schedule is flexible.

ggplot(
     dat_njstation %>% mutate(hour = hour(scheduled_time)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
    labs(title="Total Delay by day of week for NJ Transit stations, April 2018",
       x="Hour", 
       y="Delay")+
     plotTheme

Distribution across Time and Space

The plots shows how the delay varies at different stations at different times of the day (on weekdays and weekends).

coords <- as.data.frame(st_coordinates(st_as_sf(dat_njstation)))
dat_njstation$LONGITUDE = coords$Y
dat_njstation$LATITUDE = coords$X

ggplot()+
  geom_sf(data = subset_census, size = 0.2)+
  geom_point(data = dat_njstation %>% st_transform('EPSG:2236') %>% 
            mutate(hour = hour(scheduled_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(from, LATITUDE, LONGITUDE, weekend, time_of_day) %>%
              dplyr::summarize(avg_delay = mean(delay_minutes)),
            aes(x=LONGITUDE, y = LATITUDE, color = avg_delay), 
            fill = "transparent", alpha = 0.5, size = 1.5)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Delay (minutes) at different times of the day, April 2018")

Total Delays in Minutes per Day

The shows average delay per day for 30 days (Month pf April). We can observe some peaks suggesting unusually high average delays.

delay.Panel_hourly <- 
  dat_njstation %>%
  group_by(interval60) %>%
  dplyr::summarise(avg_delay = mean(delay_minutes))

ggplot(delay.Panel_hourly, aes(interval60, avg_delay)) + geom_line() + 
  labs(title="Average delay for 30 days span", x="Hour", y="Delay in minutes") + plotTheme

Feature Engineering

Distance between Stations

Here, we are using the coordinates of NJ Transit stations to find distance between them. We are using ‘distm’ function for this purpose. It outputs distance in meters, which we have converted to miles.

#unique(dat_njstation %>% filter(is.na(LATITUDE)) %>% select(from, new_line_name_geo))
# library(geosphere)
# 
# dat_njstation <- arrange(dat_njstation, date, train_id, stop_sequence)
# for(i in 1:nrow(dat_njstation)){
#     if((i %% 20000) == 0){
#       print(i)
#     }
#     row <- dat_njstation[i,]
#     if(row$from == row$to){
#       dat_njstation$dist[i] = 0
#     }
#     else{
#       prev_row <- dat_njstation[i-1,]
#       dat_njstation$dist[i] = distm (c(prev_row$LONGITUDE_map,prev_row$LATITUDE_map), c(row$LONGITUDE_map,row$LATITUDE_map), fun = distHaversine)
#     }
# }

#write.csv(as.matrix(dat_njstation %>% select(dist)), file = "dist_variable.csv")
distance <- read.csv(file = "dist_variable.csv", row.names=1)
dat_njstation <- cbind(dat_njstation, distance$X) %>%
  rename(dist = distance.X) %>%
  mutate(dist = dist/1609.344)

Number of stations for a particular route

We have also considered the number of stations for a particular route as a factor to express the overall delay.

## Check if same train runs multiple times on a particular day 
# dat_njstation [duplicated(dat_njstation %>% select(date, train_id, from, to)),]

dat_njstation <- dat_njstation %>% st_drop_geometry(.) %>%
  mutate(station_count = 1) %>%
  group_by(date, train_id) %>%
  summarise(total_stops = sum(station_count, na.rm=T)) %>%
  ungroup() %>%
  right_join(dat_njstation)

Lag Features

To see how the particular trains have been in terms of delay during its previous journeys, we have considered that feature as a lag variable. We have considered daily, weekly and bi-weekly schedules of the trains. This helps the model to make predictions based on past, train-specific data.

dat_2months <- rbind(
  read.csv('archive/2018_03.csv'),
  read.csv('archive/2018_04.csv')
)

dat_2months <- dat_2months %>%
  mutate(scheduled_time =  as_datetime(scheduled_time),
         actual_time = as_datetime(actual_time),
         date = as_date(scheduled_time),
         interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour")) %>%
  distinct()

start_date <- as_date('2018-04-01')
for(i in 1:nrow(dat_njstation))
  {
    cur_date <- dat_njstation$date[i]
    cur_train_id<- dat_njstation$train_id[i]
    cur_from <- dat_njstation$from[i]
    cur_to <- dat_njstation$to[i]

    dat_2months_filter <- dat_2months %>%
      filter(
        date %in% c(cur_date - 1, cur_date - 2, cur_date - 7, cur_date - 14),
        train_id == cur_train_id, from == cur_from, to == cur_to) %>%
      select(date, delay_minutes)

    if(i %% 10000 == 0){
      print(i)
    }

    dat_njstation$lag_1day[i] <- dat_2months_filter %>%
      filter(date == cur_date - 1) %>%
      select(delay_minutes)
    dat_njstation$lag_2day[i] <- dat_2months_filter %>%
      filter(date == cur_date - 2) %>%
      select(delay_minutes)
    dat_njstation$lag_7day[i] <- dat_2months_filter %>%
      filter(date == cur_date - 7) %>%
      select(delay_minutes)
    dat_njstation$lag_14day[i] <- dat_2months_filter %>%
      filter(date == cur_date - 14) %>%
      select(delay_minutes)
  }

get_lagvars <- function(dat_njstation_row){
    #print(dat_njstation_row)
    cur_date <- dat_njstation_row$date
    cur_train_id<- dat_njstation_row$train_id
    cur_from <- dat_njstation_row$from
    cur_to <- dat_njstation_row$to

    dat_2months_filter <- dat_2months %>%
      filter(
        date %in% c(cur_date - 1, cur_date - 2, cur_date - 7, cur_date - 14),
        train_id == cur_train_id, from == cur_from, to == cur_to) %>%
      select(date, delay_minutes)

    lag_1day <- dat_2months_filter %>%
      filter(date == cur_date - 1) %>%
      select(delay_minutes)

    if (dim(lag_1day)[1] == 0) { lag_1day <- NA }
    else { lag_1day <- lag_1day[[1]] }

    lag_2day <- dat_2months_filter %>%
      filter(date == cur_date - 2) %>%
      select(delay_minutes)

    if (dim(lag_2day)[1] == 0) { lag_2day <- NA }
    else { lag_2day <- lag_2day[[1]] }

    lag_7day <- dat_2months_filter %>%
      filter(date == cur_date - 7) %>%
      select(delay_minutes)

    if (dim(lag_7day)[1] == 0) { lag_7day <- NA }
    else { lag_7day <- lag_7day[[1]] }

    lag_14day <- dat_2months_filter %>%
      filter(date == cur_date - 14) %>%
      select(delay_minutes)

   if (dim(lag_14day)[1] == 0) { lag_14day <- NA }
    else { lag_14day <- lag_14day[[1]] }

    return(list("lag_1day" = lag_1day, "lag_2day" = lag_2day,
                "lag_1week" = lag_7day, "lag_2week" = lag_14day))
}

#sampled_dat_njstation <- sample_n(dat_njstation, 10)

#library(plyr)
#Uncomment first time
#dat_lag <- apply(dat_njstation, 1, get_lagvars)
#result <- as.data.frame(do.call(rbind, dat_lag))

#write.csv(as.matrix(result %>% select(lag_1week, lag_2week)), file = "lag_variables_week.csv")
#write.csv(as.matrix(result), file = "lag_variables.csv")
#write.csv(as.matrix(dat_njstation), file = "dat_njstation.csv")
lag_features <- read.csv(file = "lag_variables.csv",row.names=1)
dat_njstation <- cbind(dat_njstation, lag_features)

The plots below shows the delay as a function of daily, weekly and bi-weekly lags. As we can observe, the points are no where close to the line. Transformations may help improve this. Log transform brings in a lot of inf values due to presence of 0 in the lags.

dat_njstation <- dat_njstation %>%
  mutate(l1dlog = log(lag_1day),
         l1wlog = log(lag_1week),
         l2wlog = log(lag_2week),
         delog = log(delay_minutes),
         l1ds = lag_1day*lag_1day,
         l1ws = lag_1week*lag_1week,
         l2ws = lag_2week*lag_2week)
plotData.lag <- as.data.frame(dat_njstation) %>%
  filter(week == 17) %>%
  #dplyr::select(lag_1day, lag_1week, lag_2week, delay_minutes) %>%
  dplyr::select(l1ds, l1ws, l2ws, delay_minutes) %>%
  gather(Variable, Value, -delay_minutes) %>%
  mutate(Variable = fct_relevel(Variable, "lag_1day","lag_1week","lag_2week"))

#Correlation scatter plots
plotData.lag %>% 
  ggplot(aes(Value, delay_minutes)) +
  geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Delay per station as a function of time lags",
       subtitle = "One week in April, 2018") +
  plotTheme

Critical Station variable

This variable helps to understand how many trains cross a particular station. Some stations which are a part of many lines, such as Hoboken, have too many trains passing through them, which may be a cause of some delay or congestion if not handled well.

# njstation_critical_station <- dat_njstation %>%
#   group_by(to, train_id) %>%
#   dplyr::summarise(critical = n()) %>%
#   ungroup()

dat_njstation <- dat_njstation %>%
  group_by(to) %>%
  mutate(critical = n_distinct(train_id)) %>%
  ungroup()

ggplot() +
  geom_histogram(data = dat_njstation, aes(critical))

Congestion variable

This variable is slightly different from the one created above, as this delays with number of trains that cross a particular station in a particular hour. This delays with hourly congestion, a more granular variable.

njstation_cong <- dat_njstation %>%
  group_by(to, interval60) %>%
  dplyr::summarise(congestion = n()) %>%
  ungroup()

dat_njstation <- dat_njstation %>% left_join(njstation_cong)

Number of trains in every station on 15th April, 2018

one_day <- 
  dat_njstation %>%
  filter(date == as_date('2018-04-15'))

one_day.panel <-
  expand.grid(
    interval60 = unique(one_day$interval60),
    to = unique(one_day$to))

ride.animation.data <-
  mutate(one_day, Congestion_Counter = 1) %>%
  right_join(one_day.panel) %>% 
  group_by(interval60, to, LATITUDE, LONGITUDE) %>%
  dplyr::summarize(Congestion = sum(Congestion_Counter, na.rm=T)) %>% 
  ungroup() %>% 
  na.omit() %>%
  st_as_sf(., coords = c("LONGITUDE","LATITUDE"), crs = 3701) %>%
  st_set_crs(st_crs(subset_census)) %>%
  mutate(Trains = case_when(Congestion == 0 ~ "0 trains",
                           Congestion > 0 & Congestion <= 2 ~ "1-2 trains",
                           Congestion > 2 & Congestion <= 4 ~ "3-4 trains",
                           Congestion > 4 & Congestion <= 6 ~ "5-6 trains",
                           Congestion > 6 ~ "6+ trains")) %>%
  mutate(Delays = factor(Trains))

rideshare_animation <-
  ggplot() +
  geom_sf(data = subset_census) +
  geom_sf(data = ride.animation.data, aes(color = Trains)) +
  scale_fill_manual(values = palette3) +
  labs(title = "Number of trains in every station for one day in April 2018",
       subtitle = "1 hour intervals: {current_frame}") +
  transition_manual(interval60)

animate(rideshare_animation, duration=20, renderer = gifski_renderer())

anim_save("trains_local", rideshare_animation, duration=20, renderer = gifski_renderer())

Creating custom dependant variable - diff_delay

As delay is a cumulative quantity, that is delay at a particular station may be a sum of all the previous delay, unless the operator covers the delay by speeding, for instance. Thus, we have a variable diff_delay that can also be used in the prediction.

#dat_njstation <- arrange(dat_njstation, date, train_id, stop_sequence)

# for(i in 1:nrow(dat_njstation)){
#     if((i %% 20000) == 0){
#       print(i)
#     }
#     row <- dat_njstation[i,]
#     if(row$from == row$to){
#       dat_njstation$diff_delay[i] = 0
#     }
#     else{
#       prev_row <- dat_njstation[i-1,]
#       dat_njstation$diff_delay[i] = row$actual_time - prev_row$actual_time - 
#         (row$scheduled_time - prev_row$scheduled_time)
#     }
# }

#write.csv(as.matrix(dat_njstation %>% select(diff_delay)), file = "diff_delay_variable.csv")
diff_delay_var <- read.csv("diff_delay_variable.csv")
a <- diff_delay_var$diff_delay
dat_njstation <- cbind(dat_njstation, as.data.frame(a))

We are setting the NAs in the lags to 0 so that they don’t interfere with the training and testing. There are other efficient means to replace NAs.

dat_njstation$lag_1day[is.na(dat_njstation$lag_1day)] <- 0
dat_njstation$lag_1week[is.na(dat_njstation$lag_1week)] <- 0
dat_njstation$lag_2week[is.na(dat_njstation$lag_2week)] <- 0

Building Model

We have tried 2 linear regression models and a random forest model to predict train delays.

Remove Outliers and Split Data into Train and Test

The 98th percentile is 19.38 minutes. Thus, we have removed values that are higher than 20 minutes, so the outliers don’t interfere with the predictions. Weeks 13 to 16 form the training set and the rest form the test set.

#98th percentile is 19.38
dat_njstation <- dat_njstation %>% filter(delay_minutes < 20)
dat.Train <- filter(dat_njstation, week < 17)
dat.Test <- filter(dat_njstation, week >= 17)

The plot below shows the mean delay across all train lines for the train and test set. We can see that the pattern is more or less the same (as if repeating weekly), yet there is a lot of randomness. .

rbind(
  mutate(dat.Train, Legend = "Training"), 
  mutate(dat.Test, Legend = "Testing")) %>%
    group_by(Legend, interval60) %>% 
      dplyr::summarize(Trip_Delay = mean(delay_minutes)) %>%
      ungroup() %>% 
      ggplot(aes(interval60, Trip_Delay, color = Legend)) + geom_line() +
        scale_colour_manual(values = palette2) +
        labs(title="Mean Delay across all train lines",
             x="Day", y="Trip Delay (minutes)") +
        plotTheme + theme(panel.grid.major = element_blank())

The mean delay for each of the train lines is plotted in one graph. Such visualizations help the users understand which lines have more delays, thus helping them choose other lines, if such an option is availble.

dat_njstation %>%
  group_by(line, date) %>% 
  dplyr::summarize(Trip_Delay = mean(delay_minutes)) %>%
  ungroup() %>% 
  ggplot(aes(date, Trip_Delay, color = line)) + geom_line() +
        #scale_colour_manual(values = palette2) +
  labs(title="Mean Delay across each train line",
        x="Day", y="Trip Delay (minutes)") +
  plotTheme + theme(panel.grid.major = element_blank())

Users can be given an option to view the mean delay for a particular line too, as shown below (Eg. Northeast Corridor).

dat_njstation %>%
  group_by(line, date) %>% 
  filter(line == "northeast corrdr") %>%
  dplyr::summarize(Trip_Delay = mean(delay_minutes)) %>%
  ungroup() %>% 
  ggplot(aes(date, Trip_Delay, color = line)) + geom_line(size=1) +
  # scale_x_date(
  #   breaks = function(x) seq.Date(from = min(x), to = max(x), by = "5 days"),
  #   minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 day")) +
        #scale_colour_manual(values = palette2) +
  labs(title="Mean Delay over April for Northeast Corridor Line",
        x="Day", y="Trip Delay (minutes)") +
  plotTheme

Similarly, there can be bar plots showing mean delay on every day of the week, which will help users gauge the delay on a particular day that they wish to travel.

dat_njstation %>%
  filter(to == "New York Penn Station") %>% 
  group_by(dotw) %>%
  dplyr::summarise(Trip_Delay = mean(delay_minutes)) %>%
  ungroup() %>%
  ggplot(aes(dotw, Trip_Delay)) + geom_histogram(size=1, stat="identity", fill="blue") +
  plotTheme + labs(title="Mean delay for every day of the week")

Training model

Trained using actual delay_minutes as the dependent variable. The independent variables finally being considered are day of the week, from and to stations, train ID, total stops in a route, hour of the day, wind speed, temperature, precipitation, distance between stations, daily lag, weekly lag, bi-weekly lag, hourly congestion at a station, daily congestion at a station and differenced delay.

ind.vars.org <-  c("dotw", "from", "to","hour", "Wind_Speed", "Temperature", "Precipitation")

ind.vars.imp <- c("dotw", "from", "to", "train_id", "total_stops", "hour", "Wind_Speed", "Temperature",
               "Precipitation", "dist", "lag_1day", "lag_1week", "lag_2week",
               "congestion", "critical", "a")

dep.var <- "delay_minutes"

Linear Regression model

Linear regression model is run on 2 different sets of variables to show how inclusion of the newly created variables help with better predictions, but not good enough to be implemented for real world predictions. The results of the regression are as shown below. the maximum variance that the models can address is 26%, which is pretty low.

reg_base <- lm(delay_minutes ~ ., na.action = na.exclude, family = "poisson",
           data = dat.Train %>% select(ind.vars.org, dep.var))

print("R-Squared values for Regression Model")
## [1] "R-Squared values for Regression Model"
summary(reg_base)$r.squared
## [1] 0.1093728
summary(reg_base)$adj.r.squared
## [1] 0.1074544
reg <- lm(delay_minutes ~ ., na.action = na.exclude, family = "poisson",
           data = dat.Train %>% select(ind.vars.imp, dep.var))

print("R-Squared values for Regression with Improved Features")
## [1] "R-Squared values for Regression with Improved Features"
summary(reg)$r.squared
## [1] 0.2616486
summary(reg)$adj.r.squared
## [1] 0.2551312

Random Forest model

As the results from the previous model aren’t that good, we have now used Random Forest model with 50 trees for the predictions. The variance explained in this case is around 45%, which is an improvement from the previous model.

library(randomForest)
require(caTools)
library(Metrics)
rf <- randomForest(delay_minutes ~ ., na.action = na.exclude,
                   data = dat.Train  %>% select(ind.vars.imp, dep.var),
                   method = "parRF", ntree = 50)
summary(rf$mse)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.431   5.544   5.795   6.421   6.600  11.110
rf
## 
## Call:
##  randomForest(formula = delay_minutes ~ ., data = dat.Train %>%      select(ind.vars.imp, dep.var), method = "parRF", ntree = 50,      na.action = na.exclude) 
##                Type of random forest: regression
##                      Number of trees: 50
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 5.430831
##                     % Var explained: 45.3

Predicting for test data

We use nested data for the testing part.

dat.Test.weekNest <- dat.Test %>% 
  select(ind.vars.imp, week, interval60, delay_minutes, LATITUDE, LONGITUDE) %>%
  nest(-week) 

When we run our predictions and summarize our results, we are going to have some NA data, which is due to the lag variables.

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat, na.action = na.exclude)}
week_predictions <- 
  dat.Test.weekNest %>% 
    mutate(
          A_Base = map(.x = data, fit = reg_base, .f = model_pred),
          B_Regression = map(.x = data, fit = reg, .f = model_pred),
          C_RandomForest = map(.x = data, fit = rf, .f = model_pred),
    ) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, delay_minutes),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           SquareAbsErr = map2(Absolute_Error, Absolute_Error,  ~ abs(.x*.y)),
           AbsPercErr = map2(Absolute_Error, Prediction,  ~ abs(.x/.y)),
           MSE = map_dbl(SquareAbsErr, mean, na.rm = TRUE),
           sd_SE = map_dbl(SquareAbsErr, sd, na.rm = TRUE),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE),
           MAPE = map_dbl(AbsPercErr, mean, na.rm = TRUE),
           )

The table below shows the results of all the models that we have built, using the two sets of variables. We can see that the second set of variables, consisting of the improves features performed better at all times. The best model is Random forest.

kable(week_predictions %>% 
         select(-Absolute_Error, -SquareAbsErr, -Prediction, -Observed, -data, -AbsPercErr),
      caption = "Test results for all models"
) %>% kable_styling()
Test results for all models
week Regression MSE sd_SE MAE sd_AE MAPE
17 A_Base 9.477535 25.78946 2.090218 2.260230 8.604984e-01
18 A_Base 8.733182 22.26088 2.044132 2.134314 7.596364e-01
17 B_Regression 8.406730 23.74364 1.942643 2.152434 1.248790e+00
18 B_Regression 7.511964 18.72825 1.923242 1.952844 8.648842e-01
17 C_RandomForest 7.512557 21.69759 1.843929 2.027949 6.998653e+10
18 C_RandomForest 7.314431 19.88533 1.836984 1.985052 7.290114e-01

Cross validation

We perform cross validation (LOGO) across time by grouping them by the hours.

crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    #cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(indVariables, dependentVariable)
    regression <-
      lm(delay_minutes ~ ., family = "poisson", data=fold.train)
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response", na.action = na.pass))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(allPredictions)
}

# Helper function
find_na_rows <- function(data){
  M <- sapply(data, function(x) sum(is.na(x)));
  print(M[M>0][[1]])
}

The train_id feature removed to account for missing levels.

dat_njstation <- dat_njstation %>% 
  mutate(dotw_id = wday(scheduled_time))

ind.vars <- c("dotw_id", "from", "to", "total_stops", "hour", "Wind_Speed",
                "Temperature", "Precipitation", "dist", "lag_1day", "lag_1week", "lag_2week",
                "congestion", "critical")

reg.cv <- crossValidate(
  dataset = dat_njstation,
  id = "hour",
  dependentVariable = "delay_minutes",
  indVariables = ind.vars)

Distribution of MAE gives us an idea about the goodness of the model. We can see that the MAEs are on a lower side, most of it less than 0.5. and a few close to 0.6.

error_by_reg_and_fold <- 
  reg.cv %>%
    group_by(hour) %>% 
    summarize(Mean_Error = mean(Prediction - delay_minutes, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    #geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
    labs(title="Distribution of MAE", subtitle = "LOGO-CV by hour of the day",
         x="Mean Absolute Error", y="Count") 

Error Analysis

MSE Errors by model and week - Accuracy

We see that MAEs reduce as more features are added and the model is improved.

week_predictions %>%
  dplyr::select(week, Regression, MSE) %>%
  gather(Variable, MSE, -Regression, -week) %>%
  ggplot(aes(week, MSE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette3) +
    labs(title = "Mean Squared Errors by \nmodel specification and week") +
  plotTheme

Error across Time

From the plot below, we can observe that the linear model doesn’t capture all the variations. But the Random Forest model improved the forecasting by a lot. The peaks and dips have been captures well, and most points overlap showing that the predictions are actally good. The delay forecast is almost the same as the observed values.

week_predictions_pull <-  week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           to = map(data, pull, to), 
           start_lat = map(data, pull, LATITUDE), 
           start_lon = map(data, pull, LONGITUDE),
           dotw = map(data, pull, dotw))

week_predictions_pull %>%
    dplyr::select(interval60, to, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -to) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed Train delay time series",
           subtitle = "NJ Transit; A test set of 2 weeks",  x = "Hour", y= "Delay") +
      plotTheme

Errors across Space

When we look at our mean absolute errors by station there seems to be a spatial pattern to our error. The central region seems to have the highest error, when compared to regions really far away. The northeast corridor line seems to have the highest error, showing that the features aren’t helping much for that particular line.

week_predictions_pull %>%
    select(interval60, to, start_lat, start_lon, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "C_RandomForest") %>%
  group_by(to, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = subset_census , color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.8) +
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D") +
  labs(title="Mean Absolute Error, Test Set, RandomForest")

Correlation in Errors across Time

If we plot observed vs. predicted delays for different times of day during the week and weekend, some patterns begin to emerge. We are certainly underpredicting in general.

week_predictions_pull %>%
    select(interval60, to, start_lon, start_lat, Observed, Prediction, Regression, dotw) %>%
    unnest() %>%
  filter(Regression == "C_RandomForest")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed Delays", 
       y="Predicted Delays") +
  plotTheme

Error across Space and Time

There is some spatial pattern (in terms of rail lines) that we can observe with the MAEs. This changes the delay in an unpredictable way. Thus, causes high errors while forecasting. The unavailability of clear spatial or temporal pattern makes it difficult for us to easily fix/reduce the errors in predictions.

week_predictions_pull %>%
    select(interval60, to, start_lon, start_lat, Observed, Prediction, Regression, dotw) %>%
    unnest() %>%
  filter(Regression == "C_RandomForest")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(to, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE)) %>%
  ggplot(.)+
  geom_sf(data = subset_census, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", size = 1.5, alpha = 0.8)+
  scale_colour_viridis(direction = -1, discrete = FALSE, option = "D")+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")

Generalizability - Socio-Economic Factors

MAE increases with increasing Income, Percent White people but seems to be negatively related to the Percentage of people taking public transit.

week_predictions_with_census <- 
  week_predictions_pull %>% 
  unnest() %>%
  filter(Regression == "C_RandomForest") %>%
  mutate(LATITUDE = start_lat, LONGITUDE = start_lon) %>%
  st_as_sf(., coords = c("LONGITUDE","LATITUDE"), crs = 3701) %>%
  st_set_crs(st_crs(subset_census)) %>% 
  st_join(subset_census, left = TRUE) 

week_predictions_with_census %>%
  select(interval60, to, start_lon, start_lat, Observed, Prediction, Regression,dotw,
         Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  st_drop_geometry(.) %>% 
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(to, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE)) %>%
  gather(-to, -MAE, key = "variable", value = "value") %>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme

Conclusion

We can see that train delays can be predicted to a very good extent with the use to right variables. A limited set of variables, with a basic random forest algorithm is able to predict with 45% of the variance being accounted for. Our analysis and predictive model in the form of an app will give users the power of advanced analytics to help travel efficiently. The under-predictions may still mean the passengers will have to wait at certain stations or may get delays trying to reach their destination. But the wait time is definitely reduced and due to a delay prediction, they would not be overly agitated, as they were already expecting a delay. Over-predictions would cause havoc, with people missing trains.

This model can definitely be improved to get much better results. One way would be to take into account more factors that affect the train journeys. As already discussed, some factors are pretty unpredictable, but other features linked to the occurrence of such an event may be considered, if such a feature is available. Some of the reasons/excuses provided by rail authorities for trains delays are mentioned in the NJ Transhit website (https://njtranshit.com/excuses). Features related to those causes can be considered and experimented with to see if they improve the results. We could increase the number of trees in the random forest, provided we have powerful CPU. As the amount of data and features is too high, the training of Random Forest takes really long for higher number of trees.

We have only included the statistics and features for the ‘To’ station and not the from station. Inclusion of similar set of variables for the station from which the train departs can add some more information than there already is, accounting for the delay due to the departing station too.

Even after considering those cases, it is hard to predict every delay that many happen as some problems that the trains face are sudden. Thus, the limitations of such models/applications need to be understood so that there isn’t a completely blind reliance on them.